QuickTrap
Volume Number: 4
Issue Number: 3
Column Tag: The Mac Hacker
QuickTrap Routines Bypass Trap Dispatcher
By Mike Morton, University of Hawaii
Bypassing the ROM trap dispatcher
In an article a while back, I covered the basics of bypassing the Macintosh trap
dispatcher to call ROM routines directly, to speed up calls to the Toolbox and OS. In
this article, I’ll present a set of subroutines which implement this technique in a
practical way.
The package is written in MPW assembler, and should be easily callable from any
of the MPW languages. It’s short and should be portable to other development systems.
It also includes a “fail-soft” feature, in case it turns out not to work on some future
Macintosh.
A quick review
Programs call the Macintosh Toolbox and Operating System routines by executing
“illegal” instructions, which are handed to the trap dispatching code in the ROM. In
addition to the time it takes for the 680x0 processor to recover from the emotional
trauma of this illegal instruction, the dispatcher must fetch the offending instruction,
decode it, and call the routine it specifies. This is very general, since it “hides” ROM
locations from the application, but it’s also slow.
With the GetTrapAddress routine, you can calculate the address of a ROM routine
just once each time your application runs. Calling that address directly can save you a
lot of time, with very little cost in generality.
What does the dispatcher do?
Here’s the code for the dispatcher in my MacPlus ROM. Your Mac may have
something a little different, but all existing Macs seem to be similar in principle. The
dispatcher, at address $401F52 in my ROM, disassembles to:
disp:
SUBQ.L #2, SP ; add 2 bytes above CCR
MOVEM.L D1-D2/A2, -(SP) ; save 12 bytes of regs
MOVE.L 12+4(SP), A2 ; get PC of trap word
MOVE.W (A2)+, D2 ; get A-trap word
MOVE.L A2, 12+4(SP) ; restore updated PC
MOVE.W D2, D1 ; copy trap word to D1
ANDI.W #$01FF, D2 ; get just trap number
CMPI.W #$A800, D1 ; trap or OS?
BLO.S doOS ; jump if OS
LEA $0C00, A2 ; point->Toolbox dispatch
LSL.W #2, D2 ; scale number->longwords
MOVE.L (A2,D2.W), 12(SP) ; copy address to stack
CMPI.W #$AC00, D1 ; “auto-pop” bit set?
MOVEM.L (SP)+, D1-D2/A2 ; restore regs; leave CCR
BLO.S callTB ; skip if “auto-pop” off
MOVE.L (SP)+, (SP) ; RTS to caller, not glue
tBox: RTS ; “call” Toolbox routine
doOS:
LEA $0400, A2 ; point to OS dispatch
BCLR #8, D2 ; clear&test “keep A0” bit
BNE.S OSa0 ; skip to allow A0 returned
LSL.W #2, D2 ; scale number->longwords
MOVE.L (A2,D2.W), A2 ; fetch OS routine address
MOVEM.L A0-A1, -(SP) ; save regs (incl A0)
JSR (A2) ; call OS routine
MOVEM.L (SP)+, A0-A1 ; and restore OS regs
OSrt:
MOVEM.L (SP)+, D1-D2/A2; restore OUR regs
ADDQ.W #4, SP ; ignore stacked CCR
TST.W D0 ; preset CCR on result
RTS ; and return
OSa0:
LSL.W #2, D2 ; scale number->longwords
MOVE.L (A2,D2.W), A2 ; fetch OS routine address
MOVE.L A1, -(SP) ; preserve A1, *not* A0
JSR (A2) ; call OS routine
MOVE.L (SP)+, A1 ; and restore A1
BRA.S OSrt ; clean up with common code
[An aside: This is the first piece of ROM code I ever read, and I still think it’s a
great example of tight 68000 coding. It’s tighter on the Mac II, with indirect
addressing available. I can’t see any way to make it faster; can anyone spot a way to
save a few bytes, though?]
Besides figuring out which routine to call (using the Toolbox dispatch table at
$0C00 or OS table at $0400), the dispatcher also does some other important things.
For Toolbox traps, it discards the return address if the “auto-pop” bit is set -- this
is useful for “glue”. And for OS traps, it preserves D1, D2, A1 and A2, and sometimes
A0. For OS traps, it also passes the low nine bits of the trap number to the routine, in
D1,
Our task is to make a trap “dispatcher” which does all this, but is much faster.
Note, for instance, that the new code must still pass the trap number in D1.w -- I
believe this is how some routines test for flag bits set in the word. (For instance,
CmpString has a bit to specify if the comparison is case-sensitive.)
Hey, wait a minute! Isn’t it a bad idea to know how one ROM routine (the
dispatcher) communicates with all the others? Isn’t code which depends on this
interface likely to fall apart when the Mac III hits the streets? Well, first of all, it’d
be awfully hard for Apple to change hundreds of routines. But more importantly,
there’s a way to back out gracefully. Trust me; we’ll get to it
An application’s view of the QuickTrap routines
The fundamental speedup is to get rid of the dispatcher, and have one “quick
trap” routine for every real routine you’d like fast access to. For instance, if your
program does a lot of SetPort calls, you can easily create “qtSetPort”, which has
exactly the same interface and does the same thing, only faster. As you might guess,
each qtxxx routine caches the address for its routine.
Once, at the beginning of your application, you must call qtEval, which
“evaluates” each address and stores it. If you don’t call it, everything will still work
-- this is related to the fail-soft scheme.
Other than this, everything works the same as old-style trap routines.
Caching problems
Imagine that you spend a lot of time doing FrameOval calls to draw circles on the
screen, and would like to speed this up. (Actually, I’m sure the trap time is
insignificant compared to the drawing time; this is just an example.) You install
“qtFrameOval” and call it instead everything works great.
Now your friend gives you this neat, public-domain desk accessory which causes
all ovals to be drawn on your screen with smile-faces in them. [Any takers to write
this, by the way? You could call it The Smiling Moose] It does this by altering the
FrameOval trap to call it. But since your application never executes that trap, its
ovals are drawn unmolested. How can you make sure your ovals are happy?
The answer is to call qtEval at the right times -- not just at initialization but
whenever you suspect someone has installed a replacement trap routine. Since the
qtxxx routines are supposed to “cache” the real addresses, they must track new
address when they’re installed, or the cache becomes “stale”.
One way to do this is to call qtEval every time you regain control from a desk
accessory, each time you regain control from Switcher or Multi finder, and each time
you invoke an FKEY. Perhaps you’d also have to call it for every SystemTask call. And
of course you must call it if your application does any SetTrapAddress calls for the
relevant traps. In short, whenever anyone could have changed trap addresses, refresh
the cache.
A simpler approach is to change the SetTrapAddress trap by installing a prefix
routine which sets a flag in your globals that re-evaluation is needed. If DAs, FKEYs,
etc., play by the rules and use SetTrapAddress calls, nobody can make the trap tables
get out of sync with your cached addresses.
It’s tempting to call qtEval in your idle-loop as a heavy-handed way to make sure
it’s done often enough. I suspect this is a bad idea -- it can cause seemingly random
bugs.
One other way: if you use, for instance, qtFrameOval only in some code which
doesn’t relinquish control, call qtEval once before each time you enter that code.
Remember that qtEval isn’t all that speedy -- it must call GetTrapAddress for every
qtxxx routine.
Reasons not to use these routines
Because the routines are JSR’d to, they take up four bytes instead of two. This is
no big problem for most applications, but don’t change all your calls.
When you’re debugging, commands to break on traps don’t work, since your
application is not executing trap instructions. You can force these traps to occur by
disabling the caching; see below for details.
The routines use impure code. You must make sure you put them in a segment
which is locked in memory.
Which traps should you replace?
Remember that many traps take so much time that the dispatch isn’t worth
improving. Others do next to nothing, and speed up a lot. In early use of these routines
at Lotus, we estimated about thirty routines were worth replacing. In the OS world,
things like BlockMove and UprString were included. Routines which just twiddled
handles are also important, like HLock, HUnlock, HPurge, HNoPurge, and
GetHandleSize. Among the Toolbox routines, things like MoveTo and SetPort seemed to
help.
Even if a routine is slow, it may be worth tweaking if it’s called a lot. We got
measureable improvements substituting for CharWidth, DrawString, StringWidth, and
SystemTask.
You can also replace package calls, which is kind of a pain. If you want to change
all the FP68K traps to qtFP68K, you have to change Apple’s include files, since each of
the SANE macros invokes the trap. Another solution is to just redefine FP68K to be a
macro to JSR to the qtxxx routine. But then you have to define a trap like “myFP68K”
which still expands to the A-line trap -- this is because the qtxxx routine must have a
copy of the trap word.
How much does it help?
As the TV diet ads say, results vary directly with how closely you stick to the
plan. Average performance in a large Macintosh product at Lotus was improved by
about 5%. A couple of heavily CPU-bound loops were improved by 15%. These aren’t
huge gains, but considering that they took only a day or so of work to install in a very
large program, they’re pretty good.
When does the warranty run out?
OK, it’s time to face the music. If these routines dive directly into the ROM, they
may someday dive into ROM routines in a new machine which expect different
parameters. (For more on this topic, see Macintosh Technical Note #110.) Or even if
the ROM doesn’t change, some caching problem may come up if your application’s
users use some odd way of altering trap addresses and making your cache stale.
The initialization routine qtEval can be easily disabled by modifying resources.
For instance, when a user calls to complain that some FKEY or DA doesn’t work with
your application, you can quickly change a copy of the application to disable address
caching and test if that’s the problem. If it is the problem, you can either distribute
the altered application or tell power users how to edit the resources to alter the copy
they already have.
The resource used to control caching is QTRP 257. The format is simple: if the
resource is present and the first word is zero, caching is enabled. To turn off caching,
just remove this resource under Resedit (or renumber it, to easily restore caching).